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ABSTRACT 

Deep sequencing of clinical samples is now an established 
tool for the detection of infectious pathogens, with direct 
medical applications. The large amount of data generated 
provides an opportunity to detect species even at very low 
levels, provided that computational tools can effectively 
interpret potentially complex metagenomic mixtures. Data 
interpretation is complicated by the fact that short 
sequencing reads can match multiple organisms and by the 
lack of completeness of existing databases, in particular 
for viral pathogens. This interpretation problem can 
be formulated statistically as a mixture model, where 
the species of origin of each read is missing, but the 
complete knowledge of all species present in the mixture 
helps with the individual reads assignment. Several 
analytical tools have been proposed to approximately 
solve this computational problem. Here, we show that 
the use of parallel Monte Carlo Markov chains (MCMC) 
for the exploration of the species space enables the 
identification of the set of species most likely to contribute 
to the mixture. The added accuracy comes at a cost 
of increased computation time. Our approach is useful 
for solving complex mixtures involving several related 
species. We designed our method specifically for the 
analysis of deep transcriptome sequencing datasets and 
with a particular focus on viral pathogen detection, 
but the principles are applicable more generally to all 
types of metagenomics mixtures. The code is available 
on github (http://github.com/smorfopoulou/metaMix) and 
the process is currently being implemented in a user 
friendly R package (metaMix, to be submitted to CRAN). 



INTRODUCTION 

Metagenomics can be defined as the analysis of a collection 
of DNA or RNA sequences originating from a single sample. 
In practice, its scope is broad and includes the analysis of 
a diverse set of samples such as gut microbiome (HJ, (f2]l, 
environmental (0! or clinical (|4]), ©, © samples. Among 
these applications, the discovery of viral pathogens is clearly 
relevant for clinical practice ©, ©. The traditional process 
of characterizing a virus through potentially difficult and 
time consuming culture techniques is being revolutionized by 
advances in high throughput sequencing. Potential benefits 
of sequence driven methodologies include a more rapid 
turnaround time (9), combined with a largely unbiased 



approach in species detection, including the opportunity for 
unexpected discoveries. 

The analysis of shotgun sequencing data from metagenomic 
mixtures raises complex computational challenges. Part of the 
difficulty stems from the read length limitation of existing 
deep DNA sequencing technologies, an issue compounded 
by the extensive level of homology across viral and 
bacterial species. Another complication is the divergence of 
the microbial DNA sequences from the publicly available 
references. As a consequence, the assignment of a sequencing 
read to a database organism is often unclear. Lastly, the 
number of reads originating from a disease causing pathogen 
can be low ( 10). The pathogen contribution to the mixture 
depends on the biological context, the timing of sample 
extraction and the type of pathogen considered. Therefore, 
highly sensitive computational approaches are required. 

A first analytical problem is read classification, that is 
the assignment of a given sequencing read to a species. 
Several tools have been developed and these belong to two 
broadly defined classes: composition-based and similarity- 
based approaches. The read classification based on sequence 
composition relies on the intrinsic features of the reads, 
such as CG content or oligonucleotide distributions. Methods 
include PhyloPythia (TTTT) . Phymm and PhymmBL (fT2l) . 
MetaCluster (Q~3]l. These tend to focus on major classes 
in a dataset and may not perform well on low-abundance 
populations dT4l . Additionally, results are usually reliable for 
longer reads only Sl2\ . 

Similarity based methods, using homology search 
algorithms such as BLAST (15), are considered the most 
effective methods for read classification ( fT2t . at least when 
other species from the same genus are known. One of the most 
popular tools using the output of a similarity search algorithm 
is MEGAN (fT6b . MEGAN addresses ambiguous matches by 
assigning reads that have multiple possible assignments to 
several species to the taxonomic group containing all these 
species, or else their lowest common ancestor (LCA). This 
approach is accurate on a higher taxonomic level. However, it 
is lacking a formal solution to resolving ambiguous matches. 
A weakness of the similarity based methods is the appearance 
of a long tail of species, supported only by a few reads. This 
results from the classification being decided one read at a 
time, in contrast to considering all reads simultaneously. 
A different approach specifically focusing on the relative 
abundance estimation task using a non-negative LASSO 
approach is implemented in GASiC (fTTT ). This method 
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corrects the observed alignment-based abundances using the 
reference genomes similarities. 

Methods focused on the statistical inference of the set of 
present species as well as the estimation of their relative 
proportions, incorporate knowledge from all reads to assign 
each individual read to a species. From a statistical standpoint, 
this identification and quantification question can be thought 
of as an application of mixture models. These ideas have 
been applied in the metagenomics context in frequentist 
(GRAMMy £[8]>) and Bayesian (Pathoscope (O) settings. 
GRAMMy formulates the problem as a finite mixture model, 
using the Expectation-Maximization (EM) algorithm to 
estimate the relative genome abundances. Pathoscope refines 
this process by penalizing reads with ambiguous matches in 
the presence of reads with unique matches and enforcing 
parsimony within a Bayesian context. Both methods work 
with unassembled sequence data and they are not currently 
setup to incorporate an initial short read assembly step, which 
could be achieved by assigning a higher weight to contigs 
formed by multiple reads. 

A mixture model is concerned with the species relative 
abundance estimation, as well as the read to species 
assignment. A related but distinct question concerns the set 
of species which should be included in the mixture model. 
This question is closely related to the biological question of 
asking what species are present in the mixture. Including all 
species flagged as potential match by the read classification 
can introduce a large number of species (often > 1.000). 
Mixture models will, in this situation, identify a large number 
of species at low levels. This interpretation is appropriate in 
some applications. In many other cases, the expectation is that 
the underlying species set should be parsimonious and that 
some divergence with database species or sequencing errors 
can explain a large fraction of the non matching reads. 

Hence, a better statistical formulation of the community 
profiling problem is the exploration of the candidate 
organisms state-space. In this context, non nested models can 
be compared based on their marginal likelihood. Within this 
Bayesian framework, readily interpretable probabilities, such 
as the posterior probabilities of species sets can be used to 
quantify the support for a species in the mixture. Finally, more 
complex hypotheses regarding for example the number of viral 
species or the joint presence of two distinct organisms can be 
investigated. 

The main challenge behind such a formulation is 
computational. For a large number of potential species, there 
is a combinatoric explosion of the potential subsets to be 
considered. Efficient computational strategies are required to 
make this problem tractable. Here we show that this inference 
can be achieved for modern scale metagenomics datasets. Our 
strategy is based on parallel tempering, a Monte Carlo Markov 
Chain technique, which enables the use of parallel computing 
to speed up the inference. Our tool are being implemented in 
a user friendly R package (metaMix) that produces posterior 
probabilities for various models and the relative abundances 
under each model. We demonstrate its value using several 
clinical and artificial datasets. 



MATERIALS AND METHODS 
Bioinformatics pre-processing 

Prior to running the mixture model for metagenomic 
profiling, several steps are required to process the short 
read sequence data (Figure [TJ. The pipeline uses publicly 
available bioinformatics tools for each preprocessing step. 
This bioinformatics pipeline has been automated and the 
scripts are available for public use (http : / / github . com/ 
|smorf opoulou/metaMix| l. 



1 . Remove clonal reads - 
in house C++ 



2. QC & trimming - PRINSEQ 



I 



3. Remove human seqs- 
a. Novoalign b. BLASTn 

I 

5. R emove rRNA seqs - BLASTr T^ 



6. Assembly - Velvet 

(7. Contigs & unassembled reads - \ 
BLASTx J 

\ 



Species profiling - 
Bayesian Mixture Model Averaging 



Figure 1. Pipeline steps for species identification. The step of removing the 
rRNA sequences is applicable only when the aim is viral discovery. 



The first step is the removal of clonal reads using an 
in house C++ script. We then use PRINSEQ d20b to apply 
read based QC (low quality and complexity) and 3'end 
trimming. For metagenomic analysis of human samples, reads 
originating from the human host are not relevant for our 
research question. We therefore remove human host reads, 
using a two-step approach to limit computation time: initially 
a short read aligner (novoalign, www.novocraft.com), 
followed by BLASTn. The next step is only applicable when 
the focus is on virus discovery using transcriptome reads. We 
remove ribosomal RNA sequences, using BLASTn against the 
Silva rRNA database (http://www.arb-silva.de/). 

The remaining reads are assembled into contigs using the 
Velvet short read assembler d2TT l. For each contig we record 
the number of reads required for its assembly, using this 
information at the stage of species abundance estimation. A 
Velvet tuning parameter is the user defined fc-mer length that 
specifies the extent of overlap required to assemble read pairs. 
Metagenomic assembly is not a straightforward task, as short 
k-mers work best with the low abundance organisms, while 
long k-mers with the highly abundance ones. The shorter the 
k-mer the greater the chance of spurious overlaps, hence we 
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choose relatively high fc-mer length, in order to avoid chimeric 
contigs. 

For each contig and un-assembled read we record the 
potentially originating species, using the nucleotide to protein 
homology matching tool BLASTx. We use BLASTx due to 
the higher level of conservation expected at the protein level 
compared to nucleotides. This choice is guided by our focus 
in viral pathogens and viruses have high genetic diversity and 
divergence |Q. 

This step generates a sparse similarity matrix between the 
read sequences and the protein sequences (species as columns, 
reads and contigs as rows), which is stored using the R 
package Matrix to handle sparse matrices. 

The statistical method described in the remainder of 
this section considers the competing models that could 
accommodate our observed data, that is the BLASTx results 
and compares them. The different models represent different 
sets of species being present in the sample. The method works 
on two levels of inference: in the first instance we assume a 
set of species to be present in the sample and we estimate this 
model's parameters given the data. The other level of inference 
is the model comparison so as to assess the more plausible 
model. The process is iterated in order to explore the model 
state space. 

Model specification assuming a fixed set of species 

Assuming a given set of K species from which the reads can 
originate, the metagenomic problem can be summarized as a 
mixture problem, for which the assignment of the sequencing 
reads to species is unknown and must be determined. The data 
consist of N sequencing reads X = {xi,...,xj^), and for a 
given read x^ the likelihood is written as: 



K 



p{x i \w 1 K)=p(x i 



■■^wjfjixi) 



(1) 



where w = (w\,...,wk) represent the proportion of each 
of the K species in the mixture. These mixture weights are 
constrained such that 0<Wj<l and J2j w j — 1- m practice, 
we also add a category (species K + l) which we refer to as 
the "unknown" category, and captures the fact that some reads 
cannot be assigned to any species. 

Additionally fj{xj) = P(x{\xi from species j) = py is 
the probability of observing the read X{ conditional on the 
assumption that it originated from species j. We model this 
probability using the number of mismatches m between the 
translated read sequence and the reference protein sequence 
and a Poisson distribution with parameter A for that number 
of mismatches. As the read x\ could start from any point in 
the reference protein sequence, we write: 



Pij 



Pois(m;A) 



(2) 



where l p is the length of the reference protein. 

Therefore for a given set of K species, the p„- probabilities 
are regarded as known and the mixture weights will be 
estimated. 



Combining the above we see that when we know the set 
of species K, the mixture distribution gives the probability of 
observing read xf Y^j=i w jPij! namely equation alt. 

We therefore write the likelihood of the dataset A as a sum 
of K n terms: 



K 



(3) 



= 1 3=1 



Estimation of mixture weights 

Assuming a fixed set of species, the posterior probability 
distribution of the weights w given the read data X is: 



P(w\X): 



P(X\w)n{w) Ui[Y l j w jPijM w ) 

p ( x ) IUi[J2j w jPijU( w ) dw 



(4) 



A practical prior for the mixing parameters w is the 
Dirichlet distribution owing to its conjugate status to the 
multinomial distribution. Despite the use of conjugate priors, 
the probabilistic assignment of reads to species involves 
the expansion of the likelihood into K n terms which is 
computationally infeasible through direct computation. An 
efficient estimation can be performed by the introduction 
of unobserved latent variables that code for the read 
assignments. In this framework, either the Gibbs sampler d22b . 
a Monte Carlo Markov Chain technique, or the Expectation- 
Maximization (EM) (23) algorithm can be used to estimate 
the mixture weights w. EM returns a point estimate for 
w while the Gibbs sampler the distribution of w. Both 
methods were implemented and provided comparable results 
(Supplementary Text for details of implementation). 

Marginal likelihood estimation 

Each combination of species corresponds to a finite mixture 
model for which the marginal likelihood can be estimated. 
Marginal likelihood comparison has a central role in 
comparing different models {Mi,...,M m }. To compute the 
marginal likelihood P(X\M k ) for the mixture model M k one 
has to average over the parameters with respect to the prior 
distribution 7r(0fe|Mfc), where 9 k are the model parameters: 

P(X\M k )= f P(X\e k ,M k )ir(9 k \M k )d9 k (5) 
The posterior probability of the model M k is: 



P{M k \X)^P{X\M k )P{M k ) 



(6) 



where P(M k ) is the prior belief we hold for each model. 
The prior can be specified depending on the context but the 
basis of our interpretation is that parsimonious models with 
a limited number of species are more likely. Thus in this 
Bayesian framework our default prior uses a penalty limiting 
the number of species in the model. 

We approximate this penalty factor based on a user-defined 
parameter r that represents the species read support required 
by the user to believe in the presence of this species. We 
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compute the logarithmic penalty value as the log-likelihood 
difference between two models: one where all N reads belong 
to the "unknown" category and one where r reads have a 
perfect match to some unspecified species and the remaining 
N — r reads belong to the "unknown" category. The pij 
probabilities for the r reads originating from this unspecified 
species are approximated by 1 /(median protein length in the 
reference database). This parameter essentially reflects how 
many reads are required to provide credible support that a 
species is present in the mixture and acts as a probabilistic 
threshold as opposed to a deterministic one applied on a 
ranked list. 

From now on, when we refer to the marginal likelihood, 
we mean the marginal likelihood for a specific model and 
we forego conditioning on the model M k in the notation. 
Additionally, in our mixture model pij are always regarded 
as known, therefore the model parameters 6 k are the mixture 
weights w. Hence |5} becomes: 



P(X)= [ P(X\w)w(w)dw® f Y[[J2w jPij }7T{w)dw 

i 3 

(7) 

Approximating the marginal likelihood is a task both 
difficult and time-consuming. We chose the Defensive 
Importance Sampling technique (l24l l for the relative simple 
implementation compared to other approaches. This is crucial 
as we perform this approximation numerous times, for every 
species combination we consider. For more details on IS, see 
the Supplementary Text. 

However the goal of this work is to deliver results in a 
clinical setting within an actionable time-frame. We wish 
to speed up the computation without compromising the 
accuracy and the sensitivity of the results. For that reason, 
we use a point estimate of the marginal likelihood by 
means of the Expectation-Maximization (EM) algorithm. The 
different approaches were used on a benchmark dataset. The 
resulting taxonomic assignment as well as the species relative 
abundance estimates were similar between them, with the EM 
approach resulting in a 13-fold speed increase (Supplementary 
Text). 

Model comparison: exploring the set of present species 

We use a Monte Carlo Markov Chain (MCMC) to explore 
the set of present species of size 2 s — 1, where S is the total 
number of potential species. In practice we observe that S can 
be greater than 1,000. The MCMC must explore the state- 
space in a clinically useful timespan. Therefore we reduce 
the size of the state-space, by decreasing the number of S 
species to the low hundreds. We achieve this by fitting a 
mixture model with S categories, considering all potential 
species simultaneously. Post fitting, we retain only the species 
categories that are not empty, that is categories that have at 
least one read assigned to them. 

Let us assume that at step t, we deal with a set of species that 
corresponds to the mixture model M k . At the next step (t + 1), 
we either add or remove a species and the new set corresponds 
to the mixture model Mj. The step proposing the model Mi is 



accepted with probability: 

AlMh — > Mi =min\ 1, t-t ; r \ 

V ' X P(X\M k )WP(M k ) q(M k ^Mi) i 

(8) 

where q(M~i — ?> M k ) is the probability of transitioning from 
model M\ to model M k . In other words, this is the probability 
of adding or removing the species to the M k set of species that 
took us to the M; set of species. 

If the step is accepted, then the chain moves to the new 
proposed state M/. Otherwise if not accepted, the chain's 
current state becomes the previous state of the chain, i.e the 
set of species remains unchanged. 

metaMix outputs log-likelihood traceplots so that the user 
can visually inspect the mixing of the chain. The default 
setting is to discard the first 20% of the iterations as burn- 
in. We concentrate on the rest to study the distribution over 
the model choices and perform model averaging. We can 
then summarize appropriately the posterior distribution and 
answer the important questions of interest. Examples of such 
questions include: what species have probability p or greater 
being included in the set of present species? what is the 
probability of having the n specific closely related strains 
in the set of present species? Depending on the biological 
context, one may ask numerous similar or other case-specific 
questions. 

Optimized implementation: parallel tempering 

We observed that simple MCMC does not efficiently explore 
the complex model state space, as evidenced by the poor 
mixing of the chain (Figure [2]). 

In order to overcome this and take advantage of parallel 
computing, we run multiple chains and allow exchange moves 
between them. This method is called parallel tempering 
MCMC d25t . Within the parallel setting, each chain simulates 
from the posterior distribution g(M)=P(M\X) raised to a 
temperature Te(0,l], where model M represents a set of 
species being present. The different temperature levels result 
in tempered versions of the posterior distribution P{M k \X) T . 
When T = 1 the draws are from the posterior distribution. On 
the other hand, at higher temperatures the posterior spreads 
out its mass and becomes flatter. In practice that means 
that distributions at higher temperatures are easily sampled, 
improving the mixing. We are interesting in studying the 
original posterior distribution with T= 1. 

We implemented two types of moves. The first is the 
mutation step, which simply is the within chain move we 
described in the previous section. This is accepted with 
probability given by |8|. The other is the exchange step, 
a between chains move. This Metropolis-Hastings move 
proposes to swap the value of two chains k and k + 1, adjacent 
in terms of T. Suppose that the values of the two chains are 
M k and Ai^+i respectively, corresponding to two different 
sets of species. The move is accepted with probability i 



A- 



■ f , 9k(M k+ i) 9k+l{ M k) ~i , m 
-min\l, — — — — r\ (9) 

Since g k (M k )=P(M k \X) T and g k+1 (M k+1 ) = 
P{M k+ i\X) TJrl , it follows that when M k+ i represents 
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Figure 2. a. Log-like lihood trace plot for single chain MCMC and b. for PT chain at temperature T=l. c. Schematic of parallel 
tempering. Exchanges are attempted between chains of neighboring temperatures, where Chainl at T\ = 1, T\ < T2 < T3 < T4. 



a set of species of higher probability than the one 
represents, the exchange move will always be accepted 
(Figure |2j. 

This allows moves between separate modes, ensuring a 
global exploration of the model state space. Eventually "hot" 
and "cold" chains will progress towards a global mode. 



RESULTS 

We first applied our method to a reference artificial dataset, for 
which the community composition and the read assignment is 
known. We then analyzed two clinical samples generated for 
diagnosis purpose. 

We compare our results with the ones produced by MEGAN 
version 5.3 and Pathoscope 2.0. Both methods are similarity- 
based. This property and more specifically their flexibility 
to work with BLASTx output, makes them better candidates 



for viral discovery compared to other composition-based 
methods. 

From the mixture model methods, we have chosen 
Pathoscope. We were also interested in comparing our 
results to the ones by GRAMMy, which was the first 
metagenomic method to use the idea of the mixture model 
method. However it can work only with nucleotide-nucleotide 
comparisons (BLASTn), which is suboptimal for viral 
discovery. Additionally, it works only with unassembled reads 
and it requires that these are of the same fixed length. Lastly, 
a time consuming pre-structured genome and taxon directory 
is required. For these reasons, GRAMMy was not included in 
the comparison. 

Default parameters were used for all methods, unless 
stated otherwise. For metaMix the organisms reported have a 
posterior probability greater than 0.9. For a detailed discussion 
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on default settings and practical considerations for metaMix, 
please see the Supplementary Text. 

It is important to state that there is a trade-off between 
the value of the read support parameter and the number of 
present species in the output summary table. The user's choice 
can be informed by the biological context at each instance. 
Familiarity with the studied community allows for a better 
understanding on what constitutes real signal and what noise. 

As an example, for the typical human clinical sample where 
the sample collection might have occurred some time after 
the infection has taken place, a low value in order to adopt a 
sensitive approach makes sense. This is because it is plausible 
to expect traces of the infecting pathogen in such a sample. We 
set the default value for viral identification in human clinical 
samples to be r = 10. 

In a highly complex environmental metagenomic 
community where there is a plethora of species of similar 
abundances, the choice becomes less straightforward 
especially in the case of closely related strains. A large r 
value can result in the method merging together strains that 
are differentiated by fewer reads than r. On the other hand 
a low r can have the opposite effect, whereby the methods 
splits a strain into two or more strains, by moving a few reads 
from one strain to a very similar one with which they have 
equally good matches. We set the default value for general 
community profiling in environmental samples to be r = 30, 
however we report comparative results for different values. 

FAMeS simHC dataset - closely related strains 

The FAMeS artificial datasets (fhttp: //fames .| 
| jgi-psf .org/ description . html) , are simulated 
metagenomic datasets composed of random reads from 
113 isolate microbial genomes present in IMG (Integrated 
Microbial Genomes) and sequenced at the DOE Joint Genome 
Institute. They are a popular choice to use as benchmark 
datasets for various metagenomics methods. Their suitability 
rises from the fact that the number of species that form the 
metagenomic community is known as well as their relative 
abundances. The FAMeS datasets have been designed to 
model real metagenomic communities in terms of complexity 
and phylogenetic composition. 

There are three datasets: simHC, simMC, simLC 
corresponding to high, medium and low complexity of the 
metagenomic community respectively. Three methods were 
applied to simHC, as this is the highest complexity dataset, 
with many closely related strains with similar abundances 
and no dominant species. The lowest abundance is 255 reads 
out of 118,000 reads. The bioinformatics processing in this 
instance consisted of a BLASTn comparison to all NCBI 
bacterial genomes. The number of genomes mapped, retrieved 
from the the BLASTn output was —2,500. 

metaMix outperforms Pathoscope and MEGAN in the 
community profiling task and consequently in the relative 
abundance estimation, as shown in the results below. 

metaMix To limit the complexity of the fit, we used a two step 
procedure described in the Methods section. We first fitted the 
mixture model with the complete set of 2,500 species and a 
limited run length of 500 iterations. Based on this analysis, 
we identified 1,312 species supported by at least one read 



and explored this state space. To limit the computational time, 
we also considered a stronger approximation, including only 
the 374 potential species supported by at least 10 sequencing 
reads. Both approaches generated similar results, albeit the 
more complex one with 1,312 potential species required 
the quadruple of the computation time (12h instead of 3h). 
metaMix identified 116 species, detecting successfully all the 
members of the metagenomic community (Table SI). These 
were detected on the strain level except in four instances where 
a different strain of the same species, or different species 
within the same genus was detected. There were four false 
positives (Table SI). 

Pathoscope Pathoscope identified 47 species. Of these 42 are 
members of the metagenomic community. 42 are the exact 
same strain, while 3 are either the same species but different 
strain or same genus but different species. However it fails 
to detect 68 species that are actually present in the mixture. 
Tuning the parameter that enforces the parsimonious results 
(any thetaPrior greater than 10), thereby removing the unique 
read penalty Pathoscope behaves as a standard mixture model 
and identifies 165 species. With these settings, it identifies 
all but one members of the community. The organisms are 
identified at the strain level, except in three instances where 
it identified different species within the same genus. The 
major interpretation issue is the presence of a long tail of 
species (54 species) that are actually not present in the mixture 
(Supplementary Table SI). Pathoscope produced the results in 
one minute. 

MEGAN MEGAN identified 232 species. It discovered all 
original species of the community on the strain level, except 
for 9 instances where it identified a lowest common ancestor 
(LCA). Aside from the lack of strain/species specificity for 
8% of the community members, the main issue is the long tail 
of false positives species. In the species summary provided by 
MEGAN, there are 1 19 species which are not actually present, 
but supported by a sufficient number of reads (that is 50 reads) 
for MEGAN to include these in the output. It produced the 
results in less than one minute. 

Table 1. Number of species identified for the FAMeS simHC 
dataset. 



True 


MEGAN 


Pathoscope 


metaMix 


113 


232 


165 


116 



Our primary aim is a diagnostic for the presence/absence 
of species in the mixture. As a secondary aim, we are also 
interested in estimating accurately the relative abundance of 
the present organisms. We can assess the abundance estimates 
produced by the methods by using error measures such as 
the relative root mean square error, RRMSE and the average 
relative error, AVGRE. For all methods, when the exact strain 
was not identified but the correct species or genus was, we 
used this abundance. 



RRMSE = 



\ 



1 K i 

3=1 3 



(10) 
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AVGRE 



1 

K 



K I 



t 



Table 3. Number of species in the simHC FAMeS dataset by 
metaMix, Pathoscope and MEGAN 



(11) 



where tj is the true abundance of species j and wj the 
estimated abundance. 



Table 2. Measures of estimation accuracy of relative 
abundances for the FAMeS simHC dataset. 



Method 


RRMSE 


AVGRE 


metaMix 


17 


8.5 


Pathoscope 


36.6 


29.7 


MEGAN 


35.9 


18 



Importance of read support parameter We then assessed the 
importance of the read support parameter r on the output of 
metaMix. We ran metaMix on the benchmark simHC FAMeS 
dataset with r = { 10, 20, 30, 50} reads (Table[3] Figure^. We 
observe that as r decreases, a few more related strains from the 
reference database that are not in the community are retained 
in the output. As r increases two similar strains are merged 
into one. 

We compared these results with the output of Pathoscope 
and MEGAN. None of these methods have a read support 
parameter serving the same purpose as metaMix, so we tuned 
the most relevant parameters in these tools. Pathoscope has 
a thetaPrior parameter that enforces a unique read penalty. 
This parameter represents the read pseudocounts for the non- 
unique matches and the default setting is zero which allows for 
non informative priors. Using the default setting Pathoscope 
identifies 47 taxa. When thetaP's value is in (1,7) it identifies 
22 taxa, while with thetaP>7 it identifies 165. With this 
latter setting which is the one we chose for the comparison, 
Pathoscope behaves as a standard mixture model. 

MEGAN has a "Min Support" parameter which sets a 
threshold for the number of reads that must be assigned to 
a taxon so that it appears in the result. Any read assigned 
to a taxon not having the required support is pushed up the 
taxonomy until a taxon is found that has sufficient support. 
We used Min support = {10, 20, 30, 50} reads. The respective 
number of taxa in the summary files were 250, 243, 236, 232. 

We then also applied a post-run read count threshold to 
both methods' output summary. We set the threshold for 
10,20,30 reads respectively, disregarding taxa that have less 
than that number of reads assigned to them. In all instances 
metaMix produces a community profile closer to the real one. 
Pathoscope finds ^20 more false positives while MEGAN 
^40 more compared to metaMix at the same read support 
level. 

The simHC from the FAMeS datasets is a highly 
complicated dataset, not very similar to the typical human 
clinical sample. The differences are the large number of 
organisms, the presence of closely related strains of similar 
abundances, as well as the lack of viruses. Nevertheless, it is 
an essential dataset to use as a benchmark for examining the 
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Figure 3. Number of species in the simulated simHC FAMeS dataset 
detected by metaMix, Pathoscope and MEGAN, as a function of the minimum 
number of reads required for each species to appear in the output. 
For metaMix that is r={ 10, 20, 30, 50} reads, for Pathoscope thetaPrior> 7+ 
post-run threshold ={10, 20, 30, 50} reads, for MEGAN "Min Support" + 
post-run threshold ={10, 20, 30, 50} reads. 



performance of the methods in a situation of closely related 
strains in the sample. 

Human clinical sample - low viral load 

To test metaMix in a clinical setting with a low viral load, we 
used a brain biopsy RNA-Seq dataset from an undiagnosed 
encephalitis patient (UCL Hospital, data provided as part 
of a collaboration with Professor Breuer, UCL). Total 
RNA was purified from the biopsy and polyA RNA was 
separated for sequencing library preparation. The Illumina 
Miseq instrument generated 20 million paired-end reads. We 
processed the raw data using the bioinformatics pipeline 
described above. Subsequently we were left with ~ 75,000 
non-host reads and contigs. Based on the BLASTx output 
there were 1,298 potential species. 

metaMix Following the initial processing, we used metaMix 
for species identification and abundance estimation. We first 
fitted the all-species mixture model which identified 57 
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Table 4. Human clinical sample - novel virus. 
Comparison of community profile: metaMix - Pathoscope. 
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Figure 4. Human clinical sample - novel virus. 

The reads (blue lines) assigned by metaMix to Astrovirus VA1, aligned to its genome. The purple lines represent the genes of the 
virus. 



species having one or more reads assigned to them. We then 
explored this reduced state-space. The resulting species profile 
is shown in Table [2j the 12 metaMix entries correspond to 
9 species. metaMix identified two viruses confidently. The 
most abundant virus was the phiX bacteriophage: 0X174 
bacteriophage is routinely used for quality control of the 
Illumina high throughput sequencing. Five short assembled 
contigs (44 reads) with length ranging between 167bp and 
471bp and two non-assembled reads (46 reads overall) were 



assigned to the Astrovirus VA1 sequence with 97% sequence 
identity and a probability score of 1 (Figure [4}. metaMix 
also identified a number of bacteria supported by a few 
reads. These are either known laboratory reagent contaminants 
or human skin associated contaminants d27l l. The analysis 
completed in 20 minutes. 

After adding the Astrovirus VA1 and the bacteriophage 
phiX species in our mixture model, we found no evidence 
for additional viruses in the data. The presence of the 
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Astrovirus was confirmed using real-time RT-PCR (Lockwood 
et al, submitted). Sequencing of the Astrovirus genome with 
primers designed using the PCR product confirmed the fact 
that we identified a novel species. 

Pathoscope Pathoscope identified 22 taxa, corresponding to 
15 species and some genera/families (Table It also 
assigned all 46 reads to the Astrovirus. Almost all the species 
identified from metaMix were identified by Pathoscope, with 
an additional 9 taxa supported by few reads. As the method can 
only properly work with unassembled sequence data, an extra 
BLASTx similarity step had to be performed for the 91,516 
reads that had contributed to the 679 assembled contigs. After 
that Pathoscope produced the results in less than one minute. 

MEGAN MEGAN identified 19 taxa and did not detect the 
Astrovirus signal. We modified the minimum read support 
parameter from 50 reads to 10 to increase sensitivity. MEGAN 
then identified 25 taxa, including the Astrovirus VA1. The 
remaining 24 were mostly genera, relevant to the species 
detected by metaMix and Pathoscope. MEGAN produced the 
results in less than one minute. 

Human clinical sample - species absent from the database 

We then compared the performance of the three methods in 
a scenario where sequences present in the sample are absent 
from our reference database. 

We analyzed a second brain biopsy sample from an 
undiagnosed patient. 32 million RNA-Seq reads were obtained 
using the HiSeq instrument. Following initial processing using 
our bionformatics pipeline, we were left with 1,261,575 non 
host reads and contigs for subsequent analyses. There were 
3,150 potential species based on the BLASTx output. 

metaMix The all-species mixture model identified 80 species 
having one or more reads assigned to them. We then explored 
this reduced state-space. The metaMix resulting species 
profile consisted of 7 species (Supplementary Table S2). 
Two different coronaviruses strains were identified: Human 
coronavirus OC43 and Bovine coronavirus, with almost half 
a million reads assigned to each. The high read count assigned 
to both coronaviruses, as well as the similar read classification 
probabilities (Supplementary figure SI) suggest a single but 
novel strain which is not included in our reference database. 

We followed up on the sequences assigned to the 
"unknown category", i.e approximately 155K reads, by 
looking for nucleotide similarity with NR-NT using BLASTn. 
Approximately half of the reads originated from an 
untranslated region of the Coronavirus genome, which is not 
captured by the protein reference database. The remaining 
reads matched confidently to either Danio rerio (zebrafish) 
sequences or Gallus gallus (chicken), two organisms whose 
proteins are not in the human microbiome reference we are 
using. The zebrafish and chicken matches were explained 
as barcode leakage resulting from multiplexing on the same 
flowcell zebrafish and chicken RNA-Seq libraries. metaMix 
appropriately assigned these reads to the "unknown" category, 
producing a clean probabilistic summary (Supplementary 
Table S2). The method ran in 7.5 hours. 

In this instance, the metaMix results emphasize the 
importance of being able to deal with missing reference 



sequences that do not have a closely related strain or species 
in the same database. 

Pathoscope Pathoscope identified 177 species in this sample 
(Supplementary Table S2). We optimized the value of the 
unique read penalty parameter and we achieved the best 
results with the thetaPrior parameter set within the range 10- 
100. With these settings, the method identified 52 species 
(Supplementary Table S2). Our assessment is that Pathoscope 
is confused by the lack of completeness of databases 
combined with the absence of an "unknown" category, which 
prevents it from dealing with these unassigned reads sensibly. 
Pathoscope completed its analysis in 10 minutes. 

MEGAN MEGAN assigned the reads to 30 taxa. These 
included some species and genera but most were families 
(Supplementary Table S2). Approximately 250K reads could 
not be assigned to any taxonomic level. MEGAN run in 8 
minutes. 

DISCUSSION 

In this work we present metaMix, a sensitive method 
for metagenomic species identification and abundance 
estimation. The method is implemented in a R package 
(package in preparation and to be released on CRAN, 
scripts currently available at |http : //github . com/| 
|smorf opoulou /metaMix] ). Using a Bayesian mixture 
model framework, we account for model uncertainty by 
performing model averaging and we resolve ambiguous 
assignments by considering all reads simultaneously. A 
key feature of the method is that it provides probabilities 
that answer pertinent biological questions, in particular the 
posterior probability for the presence of a species in the 
mixture. Additionally it accurately quantifies the relative 
proportions of the organisms. 

This general framework is designed to address 
interpretation issues associated with closely related strains 
in the sample, low abundance organisms and absence of 
genomes from the reference database. We show that metaMix 
outperforms other methods in the community profiling task, 
particularly when complex structures with closely related 
strains are studied. It also outperforms the other methods at 
the task of relative abundance estimation. The method can 
deal with either unassembled reads or assembled contigs or 
both, allowing for flexibility of choice for the bioinformatics 
pre-processing. In practice, the choice of bioinformatics 
processing prior to the application of our Bayesian mixture 
analysis must be optimized for each application, and our 
processing pipeline has been designed with viral sequence 
identification from transcriptome sequencing as a main goal. 
Nevertheless, as demonstrated by our analysis of the artificial 
bacterial community dataset, the method can be applied in 
other contexts. 

The sensitivity and general applicability of metaMix comes 
at an increased computational cost, requiring access to a 
multi-core computer to run efficiently. For the datasets 
presented here, the computation time remained manageable 
and did not exceed a few hours, using 12 cores to run 
12 parallel chains. Nevertheless, a limitation of metaMix is 
the increased processing time for very large datasets. Speed 
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related improvements can be implemented in scenarios where 
the species ambiguity concerns only a small proportion of 
the read set. Reads with certain assignments can be flagged 
prior to the MCMC exploration of the state-space. Then only 
their assignment information can be carried forward, thereby 
reducing the size of the relevant R object. 

metaMix effectively complements faster but more 
approximate methods designed to perform similar tasks. It is 
most useful for complex datasets for which the interpretation 
is challenging. 
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Supplementary Text, Supplementary tables SI, S2, S3, 
Supplementary figure SI. 
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